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derlying functions are most often defined on a continuous state space and 
can be unbounded. We consider a regenerative setting and Monte Carlo 
estimators based on i.i.d. blocks of a Markov chain trajectory. The main 
result is an inequality for the mean square error. We also consider confi- 
dence bounds. We first derive the results in terms of the asymptotic vari- 
ance and then bound the asymptotic variance for both uniformly ergodic 
and geometrically ergodic Markov chains. 
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1. Introduction 

Suppose that we are to estimate the expectation of a function, possibly un- 
bounded and defined on a high dimensional space, with respect to some proba- 
bility density which is known only up to a normalising constant. Such problems 
arise in Bayesian inference and arc often solved using Markov chain Monte Carlo 
(MCMC) methods. The idea is to simulate a Markov chain converging to the 
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target distribution and take ergodic averages as estimates of the expectation. It 
is essential to have explicit and reliable bounds which provide information about 
how long the algorithms must be run to achieve a prescribed level of accuracy 
(c.f. [54, 25, 28]). 

We consider MCMC algorithms which use independent and identically dis- 
tributed random blocks of the underlying Markov chain, each block starting and 
ending at consecutive regeneration times. In fact, we propose a sequential version 
of regenerative estimator, for which the length of trajectory is "nearly fixed". 
This methodology is a promising alternative to both fixing the total length of 
the trajectory and fixing the number of regeneration cycles [45, 54, 26, 8, 9, 10]. 
The simulation scheme is easy to implement, provided that the regeneration 
times can be identified. We introduce our estimator and discuss its properties 
in Section 3. 

The regenerative/sequential simulation scheme which we propose allows us 
to use directly the tools of the renewal theory and statistical sequential analysis. 
Our goal is to obtain quantitative bounds on the error of MCMC estimation. 
We aim at explicit nonasymptotic results. To this end we split the analysis into 
independent parts. 

First, in Section 3, we derive inequalities on the mean square error (MSE) in 
terms of the asymptotic variance of the chain. This is obtained under very weak 
assumptions. We require a one step minorization condition (Assumption 2.1) 
and an intcgrability conditions that are essentially equivalent to those needed 
for Central Limit Theorems (CLTs) for nontrivial target functions. The proof 
of our main result, Theorem 3.3, depends on a classical result of Lorden [38] 
about the mean "excess time" for renewal processes and also on the two Wald's 
identities. 

Next, in Section 4, we consider confidence estimation via a median trick that 
leads to an exponential inequality and we argue that our nonasymptotic bounds 
are not far off the asymptotic approximation based on the CLT. 

Finally, we proceed to express the bounds in terms of computable quanti- 
ties. In Section 5 we consider uniformly ergodic chains, where bounding the 
asymptotic variance is straightforward. Moreover, in case of a bounded target 
function we can compare our approach to the well known exponential inequali- 
ties for Docblin chains. Our bound is always within a factor of at most 40/3 of the 
exponential inequality (where [3 is the regeneration parameter of the Doeblin 
chain), so it will turn out sharper for many examples of practical interest, where 
(3 is small. In Section 6 we assume the most general setting that motivates our 
work, namely a drift towards a small set, to replace the unknown asymptotic 
variance by known drift parameters. Our Assumption 6.1 is quite similar to 
many analogous drift conditions known in the literature, see e.g. [44, 51, 5]. For 
aperiodic chains Assumption 6.1 implies geometric ergodicity, but we do not 
need aperiodicity for our purposes. We build on some auxiliary results of [5] to 
derive bounds on the asymptotic variance. 

The nonasymptotic confidence intervals we derive are valid in particular for 
unbounded target functions and Markov chains that are not uniformly ergodic. 
Our assumptions are comparable (in some cases identical) to those required 
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for asymptotically valid confidence intervals (c.f. [28, 6, 17, 9]). Moreover the 
bounds are expressed in terms of known quantities and thus can be of interest for 
MCMC practitioners. In Section 8 we discuss connections with related results 
in literature from both applied and theoretical viewpoint. 

One of the benchmarks for development of MCMC technology is the impor- 
tant hierarchical Bayesian model of variance components [53, 24, 29], used e.g. 
for small area estimation in survey sampling and in actuarial mathematics. We 
illustrate our theoretical results with a simple example which can be regarded 
as a part of this model. Since the analytic solution is known in this example, 
it is possible to assess the tightness of our bounds. The full model of variance 
components will be considered in [35] and [36]. 



2. Regenerative simulation 



Let 7r be a probability distribution on a Polish space X. Consider a Markov 
transition kernel P such that ttP = tt, that is 7r is stationary with respect to 
P. Assume P is 7r-irreduciblc. The regeneration/split construction of Nummelin 
[47] and Athreya and Ney [4] rests on the following assumption. 

2.1 Assumption (Small Set). There exist a Borel set J C X of positive ir 
measure, a number (3 > and a probability measure v such that 

P(x,-) > Pl{x G J)v{-). 

Under Assumption 2.1 we can define a bivariate Markov chain (X n , T n ) on the 
space X x {0, 1} in the following way. Variable r„_i depends only on X n _i via 
P(r„_i = l|A„_i = x) = (3l(x G J). The rule of transition from (A„_i, r„_i) 
to X n is given by 

v(x n e A|r„_i = i,x„_! = x ) = u(A), 
¥(X n e A|r„_i - 0,X„_! = x) = Q(x,A), 

where Q is the normalized "residual" kernel given by 

P(x,-)-/3I{x e J)u(-) 



Q(x,-) := 



l - f3l(x e J) 



Whenever T„_i = 1, the chain regenerates at moment n. The regeneration 
epochs are 

Ti := min{n > 1 : r„_j = 1}, 
T k := min{n > T fc _x : r„_! = 1}. 

Write Tk — Tfe — T^-i- Unless specified otherwise, we assume that Xq <~ v(-) 
and therefore T := is also a time of regeneration. Symbols P and E without 
subscripts will be shorthands for P„ and K v while initial distributions other than 
v will be explicitly indicated. The random blocks 

S fe := {X Tk _ 1 ,. . .,X Th -\,T k ) 
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for k = 1,2,3,... are i.i.d. 

We assume that we can simulate the split chain (X n , L„), starting from Xq ~ 
v{-). Put differently, we are able to identify regeneration times TV Mykland et 
al. pointed out in [45] that actual sampling from Q can be avoided. Assume that 
the chain X n is generated using transition probabability P. Let v(dy) / P(x, dy) 
denote the Radon-Nikodym derivative (in practice, the ratio of densities). Then 
we can recover the regeneration indicators via 

rv 1 = i<k <!(*„_! e J)- MdX " } 



'P{X n _ u dX n ) 

where U n is a sequence of i.i.d. uniform variates independent of X n . If sampling 
from the renewal distribution v(-) is difficult then we can start the simulation 
from an arbitrary state, discard the initial part of the trajectory before the first 
time of regeneration and consider only blocks S fc for k = 2, 3, . . ., that is begin 
at T\ instead of To = 0. Thus in the regenerative scheme there is a very precise 
recipe for an "absolutely sufficient burn-in" time. 

3. Main Theorem 

Let / : X — > R be a Borel function. The objective is to compute (estimate) the 
quantity 



9 := n(f) = / n(dx)f(x). 
Jx 

We assume that 9 exists, i.e. 7r(|/|) < oo. Regenerative estimators of 9 are based 
on the block sums 

Sfc(/)== £ K X *)- 

Let us now introduce a sequential version of regenerative estimator. Fix n and 
define 

(3.1) R(n) := min{r : T r > n}. 
Our basic estimator is defined as follows. 

R(n) T K(n) -l 

(3-2) TR(n) := — E Sfc(/) = ^— E /(*)■ 

In words: we stop simulation at the first moment of regeneration past n and 
compute the usual sample average. Note that we thus generate a random number 
of blocks. Our regenerative scheme requires only as many blocks as necessary 
to make the length of trajectory at least n and the "excess time" T R ^ — n will 
be shown to be small compared to n. 



K. Latuszynski et al. / Regenerative MCMC 5 

The result below bounds the mean square error (MSE) of the estimator de- 
fined by (3.2), (3.1) and the expected number of samples used to compute it. 
Let /:=/-7r(/). 

3.3 Theorem. If Assumption 2.1 holds, E(Si(/)) 2 < oo and Erf < oo then 
and 

(ii) ET R(n) <n + n , 

where 

E( Sl (/)) 2 Erf 
- En ' n °- = Er7" L 

3.4 Corollary. Under the same assumptions, 

E(^ n) -*)»<^(l + £). 

Note that the leading term cr 2 s (/)/n in Corollary 3.4 is "asymptotically cor- 
rect" in the sense that, under our assumptions, 



liui ,,e(- V/pQ)-^ =<£,(/) and li m ^M = l. 



5.5 REMARK. Under Assumption 2.1, finiteness of E(S(/)) 2 is a sufficient and 
necessary condition for the CLT to hold for Markov chain X n and function 
/. This fact is proved in [7] in a more general setting. For our purposes it is 
important to note that cr 2 s (/) in Theorem 3.3 is indeed the asymptotic variance 
which appears in the CLT. Constant n bounds the ,,mean overshoot" or excess 
length of simulations over n. 

Proof of Theorem 3.3 (i). Note that 

R(n) 

E S fc(/) 1 «(") 



a n fc=l 

VT, 



— dk > 



E r fc fl(n) fe=1 

fe=l 

where d k := S fe (/) — 9r k = S fc (/). By the Kac theorem ([44] or [48]) we have 
ES fc (/) =mw(f) =m9, 

where 

to := Er fc = — ^n- 
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Consequently the pairs (dk,Tk) are i.i.d. with Ed fc = and Vard fc = ma^ s {f). 
Since T R ^ > n, it follows that 



E(§ TRM -9f<^E £<d . 



k=i 



Since R(n) is a stopping time with respect to Q k = o({d\, T\), . . . , (c^, Tfc)), we 
are in a position to apply the two Wald's identities. The second identity yields 




2 

.2 



Var rfi ER(n) = ma^{f) ER(n). 



But in this expression we can replace mER(n) by ETr(„) because of the first 
Wald's identity: 

R(n) 

ET R{n) = E T k = Er i ER(n) = mER(n) 

k=l 

and the claimed result follows. □ 
We now focus attention on bounding the "excess" or "overshoot" time 

A(n) := T R(n) - n. 

To this end, let us recall a classical result of the (discrete time) renewal theory. 
As before, when using symbols P and E without subscripts we refer to the chain 
started at the renewal distribution v and we write m = Et\. Let A(oo) be a 
random variable having distribution 

P (A(oo) = i) := — P(ri > i) for i = 0, 1, 2, ... . 
m 

If the distribution of t\ is aperiodic then it is well-known that A(n) — > A(oo) 
in distribution, as n — > oo, but we will not use this fact directly. Instead, we 
invoke the following elegant result. 

3.6 Proposition (Lorden [38]). 

EA(n) < 2E A(oo). 

For a newer simple proof of Lorden's inequality, we refer to [11]. Proposition 
3.6 gives us exacly what we need to conclude the proof of our main result. 
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Proof of Theorem 3.3 (ii). Write p t := P(ti = i). We have 

ea (°°) = £ Pi 

If, ^. 1 ~ jQ- - 1) 

m 2 

1 , 1 

= Et 2 - -. 

2m 1 2 

By the Lorden's theorem we obtain 

EA(n) < 2EA(oo) < — Et? - 1, 
m 

which is just the desired conclusion. □ 



4. Confidence estimation 

Although the MSE is an important quantity in its own right, it can also be 
used to construct estimates with fixed precision at a given level of confidence. 
Suppose the goal is to obtain an estimator such that 

(4.1) F(\0 - 0\ > s) < a, 

for given e > and a > 0. Corollary 3.4 combined with the Chebyshev's in- 
equality yields the following bound: 

(4-2) P(|0T RW -0|> £ )<^(l+^). 

If a is small then instead of using (4.2) directly, it is better to apply the so- 
called "median trick". This is a method introduced in 1986 in [27], later used in 
many papers concerned with computational complexity, eg. [21, 46] and further 
developed in [46] . The idea is to compute the median of independent estimates 
to boost the level of confidence. We simulate I independent copies of the Markov 
chain: 

x( j \x[ j \...,x%\... (j = l,. ..,/). 

Let 0^ be an estimator computed in jth repetition. The final estimate is 9 := 
med{0^,...,e^). We require that P(|l?« - 0\ > e) < 5 (J = 1, . . . , I) for 
some modest level of confidence 1 — S < 1 — a. This is ensured via Chebyshev's 
inequality. The well-known Chcrnoff's bound gives for odd I, 

(4.3) P (\0 - 0\ > e) < \ [45(1 - 5)f 2 = \ cxp j l - In [4<5(1 - «*)]} . 
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In this way we obtain an exponential inequality for the probability of large 
deviations without requiring the underlying variables to be bounded or even to 
have a moment generating function. It is pointed out in [46] that under some 
assumptions there is a universally optimal choice of 5. More precisely, suppose 
that the bound on ¥(\9^ —6\ > e) is of the form const/n where n is the sample 
size used in a single repetition. Then the overall number of samples nl is the 
least if we choose 5* sa 0.11969. The details are described in [46]. This method 
can be used in conjunction with our regenerative/sequential scheme. The right 
hand side of (4.2) approximately behaves like const/n. Therefore the following 
strategy is reasonably close to optimum. First choose n such that the right hand 
side of (4.2) is less than or equal to 5*. Then choose I big enough to make the 
right hand side of (4.3), with 6 = 5*, less than or equal to a. Compute estimator 
&T R{n) repeatedly, using I independent runs of the chain. We can easily see that 
(4.1)" holds if 

n> a 2 sU; +n , 

I > C 2 ln(2a) _1 and j is odd, 

where C x := 1/5* w 8.3549 and C 2 := 2/ln [45* (1 - S*)]^ 1 w 2.3147 are abso- 
lute constants. By Theorem 3.3 (ii) the overall (expected) number of generated 
samples is 

(4.4) ET R(n) l ~nl~ C a ^jp- \og{2a)-\ 

where C = C1C2 ~ 19.34 and notation Left (a, e) <~ Right(a,e) means that 
Left /Right — > 1 as a, e — > 0. To see how tight are the bounds, let us compare 

(4.4) with the familiar asymptotic approximation, based on the CLT. We obtain 

lim P(|0„ -6\>e)=a, 
for the number of samples 

(4.5) n~ °dfi [$-!(! -a/2)] 2 , 

where 6 n is a simple average over n Markov chain samples, < i>~ 1 is a quantile 
function of the standard normal distribution. Taking into account the fact that 

[$- 1 (l- a /2)] 2 ^21og(2 a )- 1 , (a^O), 

we arrive at the following conclusion. The right hand side of (4.4) is bigger 
than (4.5) roughly by a constant factor of about 10 (for small e and a). The 
important difference is that (4.4) is sufficient for an exact confidence interval 
while (4.5) only for an asymptotic one. 
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5. Bounding the asymptotic variance I - uniformly ergodic chains 

We are left with the task of bounding <J 2 s (f) and no, which appear in Theorem 
3.3, by some computable quantities in typical situations of interest. The most 
important setting for applications - that of a geometrically ergodic Markov chain 
and unbounded target function / - is deferred to the next section. Here we start 
with uniformly ergodic chains, where a direct comparison of our approach to 
exponential inequalities [22, 32] is possible. We focus on [32] which is tight in 
the sense that it reduces to the Hocffding bound when specialised to the i.i.d. 
case. 

Uniform ergodicity of a Markov chain is equivalent to 

(5.1) P h (x, ■) > (3v{-) for every x € X and some integer h>l. 

We refer to [50] or [44] for definitions of uniform and geometric ergodicity of 
Markov chains and further details related to these notions. 

In the rest of this section we assume that h = 1 and hence (5.1) reduces 
to Assumption 2.1 with J = X. This is the typical situation in applications. 
If h > 1 then P h inherits the ergodic properties of P and one can use it for 
sampling. However, we acknowledge that if P h is used, the identification of 
regeneration times can be problematic since the term P h (x,dy) - needed to 
execute the Mykland et al. trick - will be typically intractable. 

Computing no in the setting of this Section is clearly trivial, since the over- 
shoot is distributed as a geometric random variable with parameter (3. 

The problem of bounding the asymptotic variance under (5.1) was considered 
in [7] . Using results of their Section 5 with h = 1 and applying basic algebra we 
obtain 

(5.2) *</) < S (l + + 2i±fH?) < 4 „ VA 

where <r 2 = nf 2 is the stationary variance. 

With reversibility one can derive a better bound. An important class of re- 
versible chains are Independence Metropolis-Hastings chains (see e.g. [50]) that 
are known to be uniformly ergodic if and only if the rejection probability r(x) 
is uniformly bounded from 1 by say 1 — (3. This is equivalent to the candidate 
distribution being bounded below by f3w (c.f. [43, 3]) and translates into (5.1) 
with h = 1 and v{-) := 7r(-). In this setting and using reversibility Atchade and 
Perron [3] show that the spectrum of P, say S, is contained in [0,1-/3]. For the 
general case of reversible chains satisfying (5.1) with h = 1 results of [49] lead 
to S C [— 1 + (3, 1 — 0\. By the spectral decomposition theorem for self adjoint 
operators (see e.g. [20, 30]) in both cases we have 

(5-3) aid) < J s l^EfAds) < 

where -E/.p is the spectral measure associated with / and P. The formula for 
°as(/) m (5-2) and (5.3) depends on (3 in an optimal way. Moreover (5.3) is 
sharp. To see this consider the following example. 
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5.4 EXAMPLE. Let (3 < 1/2 and define a Markov chain (X n ) n > on X = {0, 1} 
with stationary distribution ir = {1/2, 1/2} and transition matrix 

1 - (3/2 (3/2 
(3/2 1 - (3/2 

Hence P = (3n + (1 - /3)7 2 and P(a:,-) > /3w. Moreover let f(x) = x. Thus 
a 2 = 1/4. Now let us compute cr 2 s (/). 

00 

<£(/) = a 2 + 2^Cov{/(Xo),/(X i )} 

i=l 

To obtain an upper bound on the total simulation effort needed for F(\6— 9\ > 
e) < a for our regenerative-sequential-mcdian estimator 8, we now combine (5.2) 
and (5.3) with (4.4) to obtain respectively 

(5.5) 19.34 ^Jlog(2a)- 1 and 19.34 (2 log(2 a )- 1 . 

From Section 4 and Example 5.4 we conclude that in (5.5) the form of functional 
dependence on all the parameters is optimal. 

For / bounded let ||/|| S p := sw P x ex f( x ) ~ m ^xex f{x) and consider the 
exponential inequality for uniformly ergodic chains from [32]. For the simple 
average over n Markov chain samples, say 9 n , for an arbitrary starting point x, 
we have 

After identifying leading terms in the resulting bound for the simulation effort 
required for P(\6 n — 9\ > e) < a and assuming, to facilitate comparisons, that 
4c 2 = ll/Usp, we see that 

(5.6) n-l^log^a)" 1 . 

Comparing (5.5) with (5.6) yields a ratio of 40/3 or 20(3 respectively. This 
in particular indicates that the dependence on (3 in [22, 32] probably can be 
improved. We note that in examples of practical interest (3 usually decays ex- 
ponentially with dimension of X and our approach will often result in a lower 
total simulation cost. Moreover, in contrast to exponential inequalities of the 
classical form, our approach is valid for an unbounded target function /. 
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6. Bounding the asymptotic variance II - drift condition 

In this Section we bound <J 2 s (f) an d appearing in Theorem 3.3, by com- 
putable quantities under drift condition and with possibly unbounded /. Using 
drift conditions is a standard approach for establishing geometric ergodicity and 
our version is one of many equivalent drifts appearing in literature. Specifically, 
let J be the small set which appears in Assumption 2.1. 

6.1 Assumption (Drift). There exist a function V : X — > [l,oo[, constants 
A < 1 and K < oo such that 



PV 2 (x) := f P(x,dy)V 2 (y) < 
Jx 



\ 2 V 2 (x) forx^J, 
K 2 for x e J, 



Unusual notation in the above drift condition is chosen to simplify further 
statements. Note that Assumption 6.1 entails 



(6.2) PV(x) < 



XV (x) for x J, 
K for x e J, 



because by Jensen's inequality PV(x) < yJPV 2 {x). This simple observation 
is also exploited in [33] and [34, 35]. Assumptions 2.1 and 6.1 will allow us to 
derive explicit bounds on <r 2 s (f) and n in terms of A, (3 and K, provided that 
function f/V is bounded. 

To simplify notation, let us write T := min{n > 1 : r„_i = 1} for the first 
time of regeneration and S := Si for the first block. In contrast with the previous 
section, we will consider initial distributions of the chain different from v and 
often equal to 7r, the stationary measure. The following proposition appears e.g. 
in [48] (for bounded g). The proof for nonnegative g is the same. 

6.3 Proposition. For g : X — > [0, oo[, 



EE{g) 2 = m 



E n g(X a ) 2 + 2 J2 ^9(X )g(X n )I(T > n) 

71=1 



Our approach is based on the following result which is a slightly modified 
special case of Propositions 4.1 and 4.4 in [5], see also [39]. To make the paper 
reasonably self-contained we include the proof in Appendix A. 

6.4 Proposition. If (6.2) holds, then 

71=1 V ' 

Let us mention that a result similar to Theorem 6.5 can also be obtained 
using methods borrowed from [15], c.f. also [52], instead of [5]. Although in 
the cited papers the inequalities are derived for coupling, they could easily be 
modified to work in the context of regeneration. We will not pursue this, because 
Proposition 6.4 is easier to apply. 

The main result in this section is the following. 
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(i) Under Assumption 2.1 and (6.2), constant n in Theorem 3.3 satisfies 
~Xtt(V)-X K-X 
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n < 2 



+ 



- 1 



1-A 13(1 -X) 
(ii) If moreover \f(x)\ < V(x) then the asymptotic variance cr 2 s (/) satisfies 

~K -A-/3" 



0(1 -A) 



<V). 



Proof, (i) We apply Proposition 6.3 to g(x) = 1. Indeed, ET 2 = ES(1) 2 and 
ET 2 /m = ES(l) 2 /m 

T-l 

< 1 + 2E T ]T V(X n ) 



n=l 



< 1 + 2 



MV)-1 i K-X 
1-A /?(1-A) 



by Proposition 6.4. The result follows because n = ET 2 /m — 1. 
(ii) By Proposition 6.3 we have 

o-l(f) = ES(/) 2 /m < EE(V) 2 /m 

T-l 

= E n V(X ) 2 + 2E T V(X )V(X n ) 

n=l 

We will use Proposition 6.4 to bound the second term. 

T-l T-l 



E x ^ V(X MX„) = E,V(X )E(^ V(X n )\X a ) 

71=1 

T— 1 

= / 7r(dx)T/( 2 ;)E ;c ^ V(X„) 

n=l 



n=l 



K-X 



1-A 



^ 2 ) + 



K — X — X/3 

W - A) 



Putting everything together, we obtain 



2A 



( rL(/)<^ 2 ) + T 3 I 7r(T/ 2 ) + 2 



7r(V). 



if - A — A/3 
- A) 



7r(n 



which is equivalent to the desired conclusion. 



□ 
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Note that in Theorem 6.5 we need only (6.2), that is the drift condition on 
V. Asssumption 6.1 is needed only to get a bound on ir(V 2 ). Indeed, it implies 
that irV 2 = irPV 2 < X 2 (irV 2 - tt(J)) + K 2 ir(J), so 

2 , K 2 -X 2 K 2 -\ 2 
TtV 2 <n(J)- — < 



1-A 2 " 1-A 2 
Analogously, (6.2) implies 



irV < tt(J) — 



K -A K — X 



< 



A - 1 - A ' 
Our final estimates are therefore the following. 
6.6 Corollary. 

(i) Under Assumptions 2.1 and 6.1, 

2 



n < 



(1 - X)[3 



1-A ' V 1-A 



(it) If \\f\\v ■= sup x \f(x)\/V(x) < oo then 



2 f 2 K 2 {2 -Kg) - 2K{2\ + 13) + 2A 2 + 2A/3 - A 2 /3 

O-asU) < \\J\\V ^ _ X yp ' 

(Hi) Moreover, can be related to by 



WfWv < ll/Hv + n X) y( , < ll/lk + 

(1 — A) vat X £x 1-A 

Proof. To prove (iii) we compute 

- \f(x)-nf\ \f(x)\ + \nf\ 



xexV V(x)J-" J " v (l-\)mf xeX V(x) 

K-X 



< fv + 



X ' 



□ 



6. 7 REMARK. In many specific examples one can obtain (with some additional 
effort) sharper bounds for 7rV, ttV 2 , \\f\\v or at least bound n(J) away from 1. 
However in general we assume that such bounds are not available and one will 
use Corollary 6.6. 
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7. Example 



The simulation experiments described below are designed to compare the bounds 
proved in this paper with actual errors of MCMC estimation. Assume that 
y = (yi, . . . , y t ) is an i.i.d. sample from the normal distribution N(/i, k -1 ), where 
k denotes the reciprocal of the variance. Thus we have 



p(j/Im. «) = p(yi, ■ ■ ■ , yt\n, «) «* /2 



exp 



The pair (/x, k) plays the role of unknown parameter. To make things simple, 
let us consider "uninformative improper priors" that is assume that p(p,, k) = 
p([i)p(k) cx k . The posterior density is then 



p{n,n\y) cx p(y\n, n)p(fi, k) 
x K f/2_1 exp 



(s 2 + (y-») 2 ) 



where 



1 * 1 * 



Note that y and s 2 only determine the location and scale of the posterior. We 
will be using a Gibbs sampler, whose performance does not depend on scaling 
and location, therefore without loss of generality we can assume that y = and 
s 2 = t. Since y = (y\, . . . , y t ) is kept fixed, let us slightly abuse notation by using 
symbols p(n\n), p{p\n) and p(p) for p{n\fJL, y), p(p\n, y) and p(p\y), respectively. 
Now, the Gibbs sampler consists of drawing samples intermittently from both 
the conditionals. Start with some (/^o, kq). Then, for i — 1,2,..., 

• Ki ~ Gamma (i/2, (t/2)(s 2 + M 2 -i)), 

• w ~N(0,l/(Ki*))- 

If we are chiefly interested in \i then it is convenient to consider the two small 
steps — > — > /x, together. The transition density is 



p(MilMi-i) = j p{^\n)p{n\^i)d 

POO 

cx / k 1 / 2 ex 



x ( S 2 + ^-i) t/2 « t/2 ^ 1 exp 



/•OO 

(s' + pUf ^- 1)/2 exp 
Jo 



(« 2 + m 2 -0 



d« 



(s 2 + Mi-l + Mi) 



dK 



cx 



( S 2 + M 2 _ 1 )* /2 ( S 2 +/^ 1 +M 2 ) 



2 .-(i+l)/2 
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The proportionality constants concealed behind the oc sign depend only on t. 
Finally we fix scale letting s 2 = t and get 

/ u 2 N */2 / 2 2x -(t+i)/2 
(7.1) p( Mi | Mi -i)cx(l+^) (1 ' N 



t j v f * 

If we consider the RHS of (7.1) as a function of /Xj only we can regard the first 
factor as constant and write 

/ / 2 x-1 2 \ _( * +1)/2 

P(wk-i)a(l+(l+^ij 
It is clear that the conditional distribution of random variable 
(7.2) ^1+ / 

is t-Student distribution with t degrees of freedom. Therefore, since the t- 
distribution has the second moment equal to t/(t — 2) for t > 2, we infer that 

Similar computation shows that the posterior marginal density of \i satisfies 
t-1 /i 2 



2 \ -1/2 
Mi-1 



,2 \ -t/2 



P (M)oc(^l+ t 

Thus the stationary distribution of our Gibbs sampler is rescaled t-Student with 
t — 1 degrees of freedom. Consequently we have 

7.3 Proposition (Drift). Assume that t > 4. Le£ 
y 2 ( M ) :=/i 2 + l 

and J = [—a, a]. The transition kernel of the (2-step) Gibbs sampler satisfies 

'\ 2 V 2 (n) for H > a; 
K 2 for M < a, 

provided that a > y/t/(t — 3). T/ie quantities X and K are given by 



py 2 ( M ) < 



t- 2 
Moreover, 

v ; t-3 
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Proof. It is enough to use the fact that 

PV 2 (p) = E( M 2 + = v) = y^y + 1 

and some simple algebra. Analogously, ^{V 2 ) — E^p 2 + 1. □ 
7.4 Proposition (Minorization). Let p m i n be a subprobability density given by 

\p{iA a ) f or H < 

\p(A*|0) /or > h{a), 
where p{-\-) is the transition density given by (7.1) and 

1/2 



Pmin(At) 



h(a) = < a 



2\ «/(*+!) 



TTien < a implies p{pi\pi-\) > Pmin{Pi)- Consequently, if we take for v 

the probability measure with the normalized density p m i n /f3 then the small set 
Assumption 2.1 holds for J = [—a, a]. Constant (3 is given by 

/J=l-P(|0|<fc(a))+P^|l?| < (l+y) 1 - 

where $ is a random variable with t-Student distribution with t degrees of free- 
dom. 

Proof. The formula for p m ; n results from minimization of p(pi \pi-i) with respect 
to e [—a, a]. We use (7.1). First compute (d/d/Zj_i)p(/Uj|/ij_i) to check that 
the function has to attain minimum cither at or at a. Thus 



Pmin(M) 



p(p\a) if p(p\a) < p{p\0); 
p(/i|0) ifp(/i|a)>p(/i|0). 



Now it is enough to solve the inequality, say, p(p\a) < p(p\0) with respect to p. 
Elementary computation shows that this inequality is fulfiled iff p < h(a). The 
formula for (3 follows from (7.2) and from the fact that 

(3= Pmin(p)dp= / p(p\a)dp+ / p(p\0)dp. 

J J\ti\<h(a) J\n\>h(a) 

□ 

7.5 REMARK. It is interesting to compare the asymptotic behavior of the 
constants in Propositions 7.3 and 7.4 for a — > oo. We can immediately see that 
A 2 — > l/(i — 2) and if 2 ~ a 2 /(t — 2). Slightly more tedious computation reveals 
that h(a) ~ const • a}^ t+1 ^ and consequently f3 ~ const • a~*/( t+1 ). 
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The parameter of interest is the posterior mean (Bayes estimator of /z). Thus 
we let /(u) — fi and 9 = E^ti. Note that our chain zt , • • • , Hi, ■ ■ ■ is a zero-mean 
martingale, so / = / and 

Obviously we have ||/||y = 1. 

In the experiments described below, t = 50 is kept fixed. Other experiments 
(not reported here) show that the value of t has little influence on the results. 
Table 1 illustrates inequalities in Theorem 3.3 and Corollary 3.4. The actual 
values of the MSE of our estimator and the mean overshoot, viz. 

MSE := E (0 Tjl(n) -Of, 
OS:=ET fl(n) -n, 

are computed empirically, using 10000 repetitions of the experiment. They can 
be compared with the bounds in 3.3 and 3.4, named henceforth 

BoundMSE := ^ (l + ^ 
n V n I 

Et 2 

BoundOS := n = — - 1. 

En 

In these formulas, we use the true value of cr 2 s (/), for which we have an an- 
alytical expression. Also m = En = tt(J)(3 is computed exactly while Erf is 
approximated via a separate (very long) series of simulations, given for two 
choices of the "small set" J = [—a, a]. We also show values of m (mean length 
of a regeneration cycle) and /3 (probability of regeneration) . 



n 


a 


MSE 


BoundMSE 


OS 


BoundOS 


m 


P 


10 


5 


0.1062 


0.1087 


0.1099 


0.2134 


1.1072 


0.9032 


100 


0.0105 


0.0107 


0.1037 


1000 


0.0011 


0.0011 


0.1073 


10 


100 


0.0821 


0.2247 


5.4768 


11.1196 


6.5043 


0.1537 


100 


0.0102 


0.0118 


5.4871 


1000 


0.0011 


0.0011 


5.4337 



Table 1. Actual values of the MSE and mean overshoot vs. bounds 3.3 and 3.4 

Table 1 clearly shows that the inequalities in Theorem 3.3 are quite sharp. 
The bound on MSE, which is of primary interest, becomes almost exact for large 
n. The bound on the mean overshoot, which can be used to estimate the cost 
of the algorithm, is also very satisfactory. 

We now proceed to the inequalities proved in Section 6 under the drift condi- 
tion, Assumption 6.1. The final bounds in Corollary 6.6 are expressed in terms 
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of the computable drift /minorization parameters, that is A, K and (3. We also 
examine how the tightness of the final bounds is influenced by replacing the true 
value of irV 2 by its upper bound. To this end we compute the bounds given in 
Theorem 6.5, using the knowledge of ttV 2 . In our example we compute A, K, (3 
and also ttV 2 via Propositions 7.3 and 7.4 for different choices of J = [—a, a]. 
Parameter t — 50 is fixed. 

Figure 1 shows how the two bounds on & 2 s (f) depend on a. The black line 
corresponds to the bound of Corollary 6.6 (ii) which involves only A, K and (3. 
The grey line gives the bound of Theorem 6.5 (ii) which assumes the knowledge 
of ttV 2 . The best values of both bounds, equal to 7.19 and 5.66, correspond to 
a = 3.93 and a — 4.33, respectively. The actual value of the asymptotic variance 
is al(f) = 1.064. 




a 



Figure 1. Bounds for the asymptotic variance cr^ s (/) as functions of a. 

Figure 2 is analogous and shows two bounds on n . Again, the black bound 
involves only the drift/minorization parameters while the grey one assumes the 
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knowlegde of ttV 2 . The best bounds, 2.94 and 2.50, obtain for a = 4.73 and 
a = 4.33, respectively. 




a 



Figure 2. Bounds for hq as functions of a. 

In contrast with the inequalities of Theorem 3.3, the bounds of Theorem 6.5 
depend significantly on t, the size of sample behind the posterior distribution. 
In Table 2 below we summarize the best (with respect to a) bounds on cr^ s (/) 
for three values of t. 



t 


cr'W) 


Bound 6.5 (ii) 


Bound 6.6 (ii) 


5 


2.500 


141.50 


41.02 


50 


1.064 


7.19 


5.66 


500 


1.006 


4.33 


3.99 



Table 2. Values of cr^ s (/) vs. bounds 6.5 and 6.6 for different values of t. 
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This clearly identifies the bottleneck of the approach: the bounds on ct^ s (/) 
under drift condition in Theorem 6.5 and Corollary 6.6 can vary widely in their 
sharpness in specific examples. We conjecture that this may be the case in 
general for any bounds derived under drift conditions. Known bounds on the rate 
of convergence (e.g. in total variation norm) obtained under drift conditions are 
often very pessimistic, too (e.g. [5, 51, 29]). However, at present, drift conditions 
remain the main and most universal tool for proving computable bounds for 
Markov chains on continuous spaces. An alternative might be working with 
conductance but to the best of our knowledge, so far this approach has been 
applied successfully only to examples with compact state spaces (see e.g. [55, 42] 
and references therein). 

8. Connections with other results 

Our aim was to obtain nonasymptotic results concerning the mean square error 
and confidence estimation in a possibly general setting relevant for MCMC 
applications in Bayesian statistics. We now discuss our results in context of 
related work. 

8.1. Related nonasymptotic results 

A vast literature on nonasymptotic analysis of Markov chains is available in 
various settings. To place our results in this context we give a brief account, 
which by no means is extensive. In the case of finite state space, an approach 
based on the spectral decomposition was used in [2, 21, 37, 46] to derive results of 
related type. For bounded functions and uniformly ergodic chains on a general 
state space, exponential inequalities with explicit constants such as those in 
[22, 32] can be applied to derive confidence bounds. Comparison of the required 
simulation effort for the same confidence interval (Sections 4 and 5) shows that 
while exponential inequalities have sharper constants, our approach gives in 
this setting the optimal dependence on the regeneration rate j3 and therefore 
can turn out more efficient in many practical examples. 

Related results come also from studying concentration of measure phenomenon 
for dependent random variables. For the large body of work in this area see e.g. 
[41], [56] and [31] (and references therein), where transportation inequalities or 
martingale approach have been used. These results, motivated in a more general 
setting, are valid for Lipschitz functions with respect to the Hamming metric. 
They also include expressions sup^g^ ||-P*( X ; ■) — P l (y, -)lltv and when applied 
to our setting, they are well suited for bounded functionals of uniformly er- 
godic Markov chains, but can not be applied to geometrically ergodic chains. 
For details we refer to the original papers and the discussion in Section 3.5 of 

[!]■ 

For lazy reversible Markov chains, nonasymptotic mean square error bounds 
have been obtained for bounded target functions in [55] in a setting where ex- 
plicit bounds on conductance are available. These results have been applied to 
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approximating integrals over balls in R d under some regularity conditions for 
the stationary measure, see [55] for details. The Markov chains considered there 
are in fact uniformly ergodic, however in their problem when establishing (5.1), 
turns out to be exponentially small and h > 1, hence conductance seems to 
be the natural approach to make the problem tractable in high dimensions. 

Tail inequalities for bounded functionals of Markov chains that are not uni- 
formly ergodic were considered in [12], [1] and [16] using regeneration techniques. 
These results apply e.g. to geometrically or subgcomctrically ergodic Markov 
chains, however they also involve non-explicit constants or require tractability 
of moment conditions of random tours between regenerations. Computing ex- 
plicit bounds from these results may be possible with additional work, but we 
do not pursue it here. 

Tail inequalities for unbounded target function / that can be applied to geo- 
metrically ergodic Markov chains have been established by Bertail and Clcmcncon 
in [10] by regenerative approach and using truncation arguments. However they 
involve non-explicit constants and can not be directly applied to confidence 
estimation. 

Rates of convergence of geometrically ergodic Markov chains to their station- 
ary distributions have been investigated in many papers. The typical setting 
is similar to our Section 6, i.e. one assumes a geometrical drift to a small set 
and a one step minorization condition. Moreover, to establish convergence rates 
one requires an additional condition that implies aperiodicity, which was not 
needed for our purposes. Most of the authors focus either on the total variation 
distance [50, 52, 51, 29, 53] or its weighted version [18, 5]. Such results, al- 
though of utmost theoretical importance, do not directly translate into bounds 
on the accuracy of estimation, because they allow us to control only the bias 
of estimates and the so-called burn-in time. Moreover we note, that in the drift 
condition setting 

• convergence to stationarity is in fact not needed, we only require a bound 
on the asymptotic variance and on the overshoot, c.f. Section 6, 

• to obtain explicit convergence rates, some version of our Proposition 6.4 
is always needed (c.f. for example Section 4 of [5]) and it is in fact one of 
several steps required for the bound, whereas we are using Proposition 6.4 
directly, avoiding other steps that could weaken the results. 

8.2. Nonasymptotic vs asymptotic confidence estimation 

Since nonasymptotic analysis of complicated Markov chains appears difficult, 
practitioners often validate MCMC estimation by a convergence diagnostics 
(see e.g. [13, 19] and references therein). It is however well-known that this 
may lead to overoptimistic conclusions, stopping the simulation far to early, 
and introducing bias [14, 40]. Designing asymptotic confidence intervals based 
on CLTs for Markov chains is often perceived as a reasonable trade-off between 
rigorous analysis of the algorithm and convergence heuristics and is referred to 
as honest MCMC estimation, c.f. [20, 25]. 
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In what follows we argue that the nonasymptotic confidence estimation pre- 
sented in the current paper requires verifying essentially the same assumptions 
as asymptotic confidence estimation. We also compare implementational diffi- 
culties. 

Asymptotic confidence estimation for Markov chains is done e.g. by estab- 
lishing Edgeworth expansions (see [8, 9]) or by applying the Glynn and Whitt 
sequential procedure [23] in the Markov chain context. Both methods rely heav- 
ily on strongly consistent estimation of the asymptotic variance. There has been 
a lot of work done recently to analyse asymptotic variance estimators for Markov 
chains and enable strongly consistent estimation under tractable assumptions 
[17, 28, 6, 26, 9, 8]. In particular we note the following. 

• The most commonly used regenerative estimators (see e.g. [28, 26, 8, 9]) 
are known to be strongly consistent for geometrically ergodic Markov 
chains that satisfy a one step minorization condition and an integrabil- 
ity condition E 7r |/| 2+<5 < oo (Proposition 1 of [28]). 

• Similarly, the non-overlapping and overlapping batch means estimators 
(see e.g. [28, 17]) are known to be strongly consistent for geometrically 
ergodic Markov chains that satisfy a one step minorization condition and 
an integrability condition E w |/| 2+<5 < oo (Proposition 4 of [6] and Theorem 
2 of [17] respectively). 

• Spectral variance estimators are known to be strongly consistent for ge- 
ometrically ergodic Markov chains that satisfy a one step minorization 
condition and an integrability condition E„.|/| 4+(5 < oo (Theorem 1 of 
[17])- 

We note that geometrical ergodicity is typically established by a drift condition 
similar to the one used in Section 6 and the one step minorization condition 
usually boils down to our Assumption 2.1. As for integrability conditions, the 
drift condition implies irV 2 < oo and we require f 2 < V 2 . Checking E 7T \f\ 2+s < 
oo will be typically done by ensuring |/| 2+<5 < V 2 and is therefore comparable, 
whereas the condition E v \f\ A+5 < oo for spectral variance estimation is clearly 
stronger. 

On the algorithmic side, regenerative asymptotic variance estimators require 
identifying regenerations, exactly as we do, whereas batch means and spectral 
variance estimators do not require this. 

Therefore we conclude, that if regenerations are identifiable, the price for the 
rigorous, nonasymptotic result is only as high as the difference between cr 2 s (/) 
and its upper bounds e.g. those in Section 6. 
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Appendix A: Proof of Proposition 6.4 

Proof. Under (6.2) and Assumption 2.1 we are to establish 

E x ^V{X n )< 1 _ A 

n=l v ' 

The idea is to decompose the sum into shorter blocks, such that each block ends 
at a visit to J. Let S := So := min{n > : X n e J} and Sj := min{n > Sj-i : 
X n e J} for j = 1,2, Introduce the following notations: 

s 



H(x) := E x J2 V(X n ), for x€X, 

H := supE x fv y(X„) r = = sup / Q(x,dy)H(y). 
xe.J \* I xeJJ 



\n=l 

Note that H(x) = V(x) for x e J and that Q denotes the normalized "residual 
kernel" . 

Let us first bound H(x). It is easy to check that under (6.2), for every initial 
distribution, V(X nAS )/X nAS for n = 0, 1, . . . is a supermartingalc with respect 
to T n := a{X , . . .,X n ). Therefore E x V{X nAS ) / X nAS < V(x) for every i£l 
and n = 0,1,.... This inequality can be multiplied by A™ and rewiritten as 
follows: 

E x V{X s )\ n - s I(S < n)+E x V{X n )I{n < S) < X n V(x). 
Now take a sum over n = 0, 1, . . . to obtain 

oo 5 oo 

E X V{X S ) ]T \ n - s + E x Y,V(X n ) <V(x)Y,^ n 

n=S+l n=0 n=0 

or, equivalently, 

(A.l) E x V(X s ) T ^+H(x)<V(x) I ^j. 

Consequently, since E x V(Xs) > 1, we have for exery x, 
V(x) - A 



(A.2) H{x) < 



1 - A 



From (6.2) we obtain PV(x) = (1 - (3)QV(x) + [3vV < K for x e J, so 
QV(x) <{K- f3)/(l - j3) and, taking into account (A.2), 

rA ^ (A -/?)/(!-/?) -A K-\-(3(l-\) 

(A - 3) H - —x = (i-m-P) ■ 
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Recall that T := min{ra > 1 : r n _i = 1}. For x € J we thus have 

T-1 

Y, V ( X n) 
n=l 

oo Sj 

= E * E E n*n)i(r So = • • • = IV, = o) 

j=l n=S 3 _!+l 
oo / 

j=l \n=S 3 _i + l 



r s , = - = r s ,_, =o (l-ffl' 



by (A. 3). For x J we have to add one more term at the beginning: 

T-1 So 

E x J2 V(X n ) =E X J2 

n— 1 n— 1 

oo Sj 

+ E *E E n^n)i(r So = --- = r Si _ 1 =o). 

j = l n =Sj-! + l 

This extra term is equal to H(x) — V(x) and we can use (A. 2) to bound it. 
Finally we obtain 

(A.4) E x £ V(X n ) < X <yW~\ x * J) + - 1. 

□ 
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